In [6]:
from google.colab import drive
drive.mount("/content/drive")
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
In [7]:
!pip install h2o
Requirement already satisfied: h2o in /usr/local/lib/python3.7/dist-packages (3.34.0.3)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from h2o) (2.23.0)
Requirement already satisfied: future in /usr/local/lib/python3.7/dist-packages (from h2o) (0.16.0)
Requirement already satisfied: tabulate in /usr/local/lib/python3.7/dist-packages (from h2o) (0.8.9)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->h2o) (3.0.4)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->h2o) (1.24.3)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->h2o) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->h2o) (2021.5.30)
In [8]:
import h2o
from h2o.automl import H2OAutoML

h2o.init()

# Import data
f = "/content/drive/My Drive/bug_pred.csv"
df = h2o.import_file(f)

# Reponse column
y = "defects"

# Split into train & test
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

# Run AutoML for 1 minute
aml = H2OAutoML(max_runtime_secs=60, seed=1)
aml.train(y=y, training_frame=train)

# Explain leader model & compare with all AutoML models
exa = aml.explain(test)

# Explain a single H2O model (e.g. leader model from AutoML)
exm = aml.leader.explain(test)

# Explain a generic list of models
# use h2o.explain as follows:
# exl = h2o.explain(model_list, test)

#       1. loc             : numeric % McCabe's line count of code
#       2. v(g)            : numeric % McCabe "cyclomatic complexity"
#       3. ev(g)           : numeric % McCabe "essential complexity"
#       4. iv(g)           : numeric % McCabe "design complexity"
#       5. n               : numeric % Halstead total operators + operands
#       6. v               : numeric % Halstead "volume"
#       7. l               : numeric % Halstead "program length"
#       8. d               : numeric % Halstead "difficulty"
#       9. i               : numeric % Halstead "intelligence"
#      10. e               : numeric % Halstead "effort"
#      11. b               : numeric % Halstead 
#      12. t               : numeric % Halstead's time estimator
#      13. lOCode          : numeric % Halstead's line count
#      14. lOComment       : numeric % Halstead's count of lines of comments
#      15. lOBlank         : numeric % Halstead's count of blank lines
#      16. lOCodeAndComment: numeric
#      17. uniq_Op         : numeric % unique operators
#      18. uniq_Opnd       : numeric % unique operands
#      19. total_Op        : numeric % total operators
#      20. total_Opnd      : numeric % total operands
#      21: branchCount     : numeric % of the flow graph
#      22. defects         : {false,true} % module has/has not one or more 
#                                         % reported defects
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O_cluster_uptime: 2 mins 57 secs
H2O_cluster_timezone: Etc/UTC
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.34.0.3
H2O_cluster_version_age: 28 days, 23 hours and 22 minutes
H2O_cluster_name: H2O_from_python_unknownUser_a1bber
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 3.162 Gb
H2O_cluster_total_cores: 2
H2O_cluster_allowed_cores: 2
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.7.12 final
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |███████████████████████████████████████████████████████████████| (done) 100%

Leaderboard

Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
model_id auc logloss aucpr mean_per_class_error rmse mse training_time_ms predict_time_per_row_msalgo
StackedEnsemble_BestOfFamily_4_AutoML_2_20211105_1106570.814457 0.2452720.221993 0.2245640.2708610.0733656 447 0.053338StackedEnsemble
GBM_grid_1_AutoML_2_20211105_110657_model_3 0.811326 0.2407650.276307 0.26003 0.2633940.0693766 126 0.020772GBM
GBM_grid_1_AutoML_2_20211105_110657_model_5 0.806839 0.2547640.212941 0.3122570.2739220.0750332 212 0.01491 GBM
StackedEnsemble_BestOfFamily_3_AutoML_2_20211105_1106570.805146 0.2482 0.198759 0.2110210.2728850.0744661 337 0.036874StackedEnsemble
StackedEnsemble_BestOfFamily_2_AutoML_2_20211105_1106570.802184 0.2402830.201232 0.2466990.2684070.0720426 343 0.028962StackedEnsemble
DRF_1_AutoML_2_20211105_110657 0.796259 0.2626090.237707 0.2429320.2765890.0765014 146 0.015528DRF
DeepLearning_1_AutoML_2_20211105_110657 0.795835 0.3922990.234921 0.3008720.2894220.083765 111 0.011175DeepLearning
StackedEnsemble_AllModels_1_AutoML_2_20211105_110657 0.795412 0.2494440.194681 0.2086510.2728740.0744604 431 0.052121StackedEnsemble
StackedEnsemble_AllModels_2_AutoML_2_20211105_110657 0.794735 0.2497740.19746 0.2387420.2741220.0751431 437 0.055642StackedEnsemble
XGBoost_grid_1_AutoML_2_20211105_110657_model_3 0.79118 0.2623320.230944 0.3094630.2762950.0763388 217 0.014467XGBoost
GBM_3_AutoML_2_20211105_110657 0.790503 0.2634890.196818 0.31505 0.2760930.0762273 189 0.020172GBM
GBM_grid_1_AutoML_2_20211105_110657_model_1 0.790249 0.2550740.239563 0.2700190.2710060.0734441 121 0.029482GBM
XGBoost_grid_1_AutoML_2_20211105_110657_model_7 0.789318 0.24896 0.210234 0.2915190.2659760.0707431 112 0.010807XGBoost
XGBoost_grid_1_AutoML_2_20211105_110657_model_8 0.788217 0.3280120.194577 0.26604 0.2941340.086515 138 0.012883XGBoost
XGBoost_3_AutoML_2_20211105_110657 0.780472 0.2653680.206416 0.2466990.2764540.0764271 446 0.013474XGBoost
GBM_grid_1_AutoML_2_20211105_110657_model_4 0.780388 0.2516170.238559 0.29173 0.2671240.0713554 189 0.019175GBM
XGBoost_grid_1_AutoML_2_20211105_110657_model_10 0.77933 0.3151340.178319 0.2947350.29535 0.0872314 194 0.012541XGBoost
GBM_2_AutoML_2_20211105_110657 0.778399 0.2613710.20568 0.2954970.2727490.0743919 191 0.02372 GBM
XGBoost_grid_1_AutoML_2_20211105_110657_model_2 0.774843 0.2577680.193585 0.2739970.2719950.0739815 159 0.011466XGBoost
XGBoost_grid_1_AutoML_2_20211105_110657_model_5 0.766675 0.2584870.18156 0.3110720.2688880.0723009 116 0.009935XGBoost

Confusion Matrix

Confusion matrix shows a predicted class vs an actual class.

StackedEnsemble_BestOfFamily_4_AutoML_2_20211105_110657

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.15450129064581739: 
false true Error Rate
0 false 76.0 15.0 0.1648 (15.0/91.0)
1 true 5.0 11.0 0.3125 (5.0/16.0)
2 Total 81.0 26.0 0.1869 (20.0/107.0)

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.

Variable Importance Heatmap

Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.

Model Correlation

This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.





Confusion Matrix

Confusion matrix shows a predicted class vs an actual class.

StackedEnsemble_BestOfFamily_4_AutoML_2_20211105_110657

Confusion Matrix (Act/Pred) for max f1 @ threshold = 0.15450129064581739: 
false true Error Rate
0 false 76.0 15.0 0.1648 (15.0/91.0)
1 true 5.0 11.0 0.3125 (5.0/16.0)
2 Total 81.0 26.0 0.1869 (20.0/107.0)

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.





















Task 1 & 2¶

Try the explain the relation between attributes and the prediction on Bike rental dataset¶

Dataset: http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset¶

You can use the code from the above blocks.¶

1. Importing Required Libraries¶

In [1]:
import h2o
from h2o.automl import H2OAutoML
import matplotlib
import pandas

2. Initiating H20 Instance¶

In [3]:
h2o.init()
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O_cluster_uptime: 1 hour 9 mins
H2O_cluster_timezone: Asia/Kolkata
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.36.0.3
H2O_cluster_version_age: 18 days
H2O_cluster_name: H2O_from_python_Venkat_sm0uc4
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 3.673 Gb
H2O_cluster_total_cores: 8
H2O_cluster_allowed_cores: 8
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
Python_version: 3.10.2 final

3. Importing Bike Dataset and Feature Understanding¶

Data Set Information:¶
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.¶
Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data.¶
Attribute Information:¶
Both hour.csv and day.csv have the following fields, except hr which is not available in day.csv¶
    • instant: record index
    • dteday : date
    • season : season (1:winter, 2:spring, 3:summer, 4:fall)
    • yr : year (0: 2011, 1:2012)
    • mnth : month ( 1 to 12)
    • hr : hour (0 to 23)
    • holiday : weather day is holiday or not (extracted from [Web Link])
    • weekday : day of the week
    • workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
    • weathersit :
    • 1: Clear, Few clouds, Partly cloudy, Partly cloudy
    • 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
    • 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
    • 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
    • temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
    • atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
    • hum: Normalized humidity. The values are divided to 100 (max)
    • windspeed: Normalized wind speed. The values are divided to 67 (max)
    • casual: count of casual users
    • registered: count of registered users
    • cnt: count of total rental bikes including both casual and registered
In [60]:
# Import data
f = "hour.csv"
df = pd.read_csv(f)
In [61]:
df.describe()
Out[61]:
instant season yr mnth hr holiday weekday workingday weathersit temp atemp hum windspeed casual registered cnt
count 17379.0000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000
mean 8690.0000 2.501640 0.502561 6.537775 11.546752 0.028770 3.003683 0.682721 1.425283 0.496987 0.475775 0.627229 0.190098 35.676218 153.786869 189.463088
std 5017.0295 1.106918 0.500008 3.438776 6.914405 0.167165 2.005771 0.465431 0.639357 0.192556 0.171850 0.192930 0.122340 49.305030 151.357286 181.387599
min 1.0000 1.000000 0.000000 1.000000 0.000000 0.000000 0.000000 0.000000 1.000000 0.020000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000
25% 4345.5000 2.000000 0.000000 4.000000 6.000000 0.000000 1.000000 0.000000 1.000000 0.340000 0.333300 0.480000 0.104500 4.000000 34.000000 40.000000
50% 8690.0000 3.000000 1.000000 7.000000 12.000000 0.000000 3.000000 1.000000 1.000000 0.500000 0.484800 0.630000 0.194000 17.000000 115.000000 142.000000
75% 13034.5000 3.000000 1.000000 10.000000 18.000000 0.000000 5.000000 1.000000 2.000000 0.660000 0.621200 0.780000 0.253700 48.000000 220.000000 281.000000
max 17379.0000 4.000000 1.000000 12.000000 23.000000 1.000000 6.000000 1.000000 4.000000 1.000000 1.000000 1.000000 0.850700 367.000000 886.000000 977.000000
In [62]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17379 entries, 0 to 17378
Data columns (total 17 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   instant     17379 non-null  int64  
 1   dteday      17379 non-null  object 
 2   season      17379 non-null  int64  
 3   yr          17379 non-null  int64  
 4   mnth        17379 non-null  int64  
 5   hr          17379 non-null  int64  
 6   holiday     17379 non-null  int64  
 7   weekday     17379 non-null  int64  
 8   workingday  17379 non-null  int64  
 9   weathersit  17379 non-null  int64  
 10  temp        17379 non-null  float64
 11  atemp       17379 non-null  float64
 12  hum         17379 non-null  float64
 13  windspeed   17379 non-null  float64
 14  casual      17379 non-null  int64  
 15  registered  17379 non-null  int64  
 16  cnt         17379 non-null  int64  
dtypes: float64(4), int64(12), object(1)
memory usage: 2.3+ MB
  • No Null values are present for any column
  • Except for dteday column , all other columns are numeric , so approprite scaling might be required for few columns
  • dteday column will be replaced with newer columns like month and day since it is a custom column

4. Preprocessing Bike Dataset¶

In [63]:
# drop None
df =df.dropna()

# drop duplicates
df = df.drop_duplicates()

# dropping Instant and yr column as it is just a row identifier and counts fields
df = df.drop(['instant'], axis=1)
df = df.drop(['yr'], axis=1)
df = df.drop(['casual'], axis=1)
df = df.drop(['registered'], axis=1)

# adding new variables from existing column dteday : day
df['day'] = pd.DatetimeIndex(df['dteday']).day

# drop the dte day field
df.drop('dteday', 1, inplace=True)

# multiplying the temp ,atemp, hum, and windspeed by 100 to scale them
df['temp']=df['temp']*100
df['atemp']=df['atemp']*100
df['hum']=df['hum']*100
df['windspeed']=df['windspeed']*100

# adding new boolean columns
df['IsSummer']= (df['season']==3)
df['IsWinter']= (df['season']==1)
df['BadWeather']= (df['weathersit']>2)
df['IsWeekend']= (df['weekday']==0) | (df['weekday']==6)

# drop the weekday field
df.drop('weekday', 1, inplace=True)
C:\Users\Venkat\AppData\Local\Temp\ipykernel_15828\4246728557.py:17: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  df.drop('dteday', 1, inplace=True)
C:\Users\Venkat\AppData\Local\Temp\ipykernel_15828\4246728557.py:32: FutureWarning: In a future version of pandas all arguments of DataFrame.drop except for the argument 'labels' will be keyword-only.
  df.drop('weekday', 1, inplace=True)
In [64]:
df.describe()
Out[64]:
season mnth hr holiday workingday weathersit temp atemp hum windspeed cnt day
count 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000 17379.000000
mean 2.501640 6.537775 11.546752 0.028770 0.682721 1.425283 49.698717 47.577510 62.722884 19.009761 189.463088 15.683411
std 1.106918 3.438776 6.914405 0.167165 0.465431 0.639357 19.255612 17.185022 19.292983 12.234023 181.387599 8.789373
min 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 2.000000 0.000000 0.000000 0.000000 1.000000 1.000000
25% 2.000000 4.000000 6.000000 0.000000 0.000000 1.000000 34.000000 33.330000 48.000000 10.450000 40.000000 8.000000
50% 3.000000 7.000000 12.000000 0.000000 1.000000 1.000000 50.000000 48.480000 63.000000 19.400000 142.000000 16.000000
75% 3.000000 10.000000 18.000000 0.000000 1.000000 2.000000 66.000000 62.120000 78.000000 25.370000 281.000000 23.000000
max 4.000000 12.000000 23.000000 1.000000 1.000000 4.000000 100.000000 100.000000 100.000000 85.070000 977.000000 31.000000
In [65]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 17379 entries, 0 to 17378
Data columns (total 16 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   season      17379 non-null  int64  
 1   mnth        17379 non-null  int64  
 2   hr          17379 non-null  int64  
 3   holiday     17379 non-null  int64  
 4   workingday  17379 non-null  int64  
 5   weathersit  17379 non-null  int64  
 6   temp        17379 non-null  float64
 7   atemp       17379 non-null  float64
 8   hum         17379 non-null  float64
 9   windspeed   17379 non-null  float64
 10  cnt         17379 non-null  int64  
 11  day         17379 non-null  int64  
 12  IsSummer    17379 non-null  bool   
 13  IsWinter    17379 non-null  bool   
 14  BadWeather  17379 non-null  bool   
 15  IsWeekend   17379 non-null  bool   
dtypes: bool(4), float64(4), int64(8)
memory usage: 1.8 MB

5. Making a H2O frame from Pandas dataframe¶

In [66]:
df = h2o.H2OFrame(df)

# Reponse column
y = "cnt"
Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%

6. Splitting the dataset to train and test datasets¶

In [67]:
# Split into train & test
splits = df.split_frame(ratios = [0.8], seed = 1)
train = splits[0]
test = splits[1]

7. AutoML Training¶

In [68]:
# Run AutoML for 1 minute
aml = H2OAutoML(max_runtime_secs=60, seed=1)
aml.train(y=y, training_frame=train)
AutoML progress: |█
22:19:25.645: AutoML: XGBoost is not available; skipping it.
22:19:25.646: Step 'best_of_family_xgboost' not defined in provider 'StackedEnsemble': skipping it.
22:19:25.647: Step 'all_xgboost' not defined in provider 'StackedEnsemble': skipping it.

██████████████████████████████████████████████████████████████| (done) 100%
Model Details
=============
H2OStackedEnsembleEstimator :  Stacked Ensemble
Model Key:  StackedEnsemble_AllModels_1_AutoML_5_20220306_221925

No model summary for this model

ModelMetricsRegressionGLM: stackedensemble
** Reported on train data. **

MSE: 1030.841120734402
RMSE: 32.10671457397038
MAE: 21.299733452897698
RMSLE: NaN
R^2: 0.9688112965309604
Mean Residual Deviance: 1030.841120734402
Null degrees of freedom: 10026
Residual degrees of freedom: 10022
Null deviance: 331409862.9163857
Residual deviance: 10336243.917603849
AIC: 98036.02656529291

ModelMetricsRegressionGLM: stackedensemble
** Reported on cross-validation data. **

MSE: 3267.6658207702208
RMSE: 57.16350077427222
MAE: 37.419970123986694
RMSLE: NaN
R^2: 0.9021007511808254
Mean Residual Deviance: 3267.6658207702208
Null degrees of freedom: 13963
Residual degrees of freedom: 13959
Null deviance: 466158525.07057345
Residual deviance: 45629685.52123536
AIC: 152634.4461305511
Out[68]:

8. AutoML Explaination and Models Comparison¶

In [70]:
# Explain leader model & compare with all AutoML models
exa = aml.explain(test)

Leaderboard

Leaderboard shows models with their metrics. When provided with H2OAutoML object, the leaderboard shows 5-fold cross-validated metrics by default (depending on the H2OAutoML settings), otherwise it shows metrics computed on the frame. At most 20 models are shown by default.
model_id mean_residual_deviance rmse mse mae rmsle training_time_ms predict_time_per_row_msalgo
StackedEnsemble_AllModels_1_AutoML_5_20220306_221925 3267.67 57.1635 3267.67 37.42 nan 520 0.035227StackedEnsemble
StackedEnsemble_AllModels_2_AutoML_5_20220306_221925 3270.2 57.1857 3270.2 37.4376nan 500 0.042155StackedEnsemble
StackedEnsemble_BestOfFamily_3_AutoML_5_20220306_221925 3340.01 57.7929 3340.01 38.1735nan 483 0.031644StackedEnsemble
StackedEnsemble_BestOfFamily_2_AutoML_5_20220306_221925 3340.75 57.7993 3340.75 38.1798nan 452 0.031307StackedEnsemble
GBM_3_AutoML_5_20220306_221925 3356.91 57.9388 3356.91 38.4177nan 1064 0.030055GBM
GBM_4_AutoML_5_20220306_221925 3386.16 58.1907 3386.16 37.9812nan 1139 0.026926GBM
GBM_2_AutoML_5_20220306_221925 3595.1 59.9592 3595.1 40.0286nan 901 0.024981GBM
GBM_5_AutoML_5_20220306_221925 3793.44 61.5909 3793.44 41.555 nan 775 0.023318GBM
StackedEnsemble_BestOfFamily_1_AutoML_5_20220306_221925 3810.33 61.7279 3810.33 41.7574nan 489 0.033813StackedEnsemble
GBM_1_AutoML_5_20220306_221925 3811.23 61.7352 3811.23 41.7434nan 1276 0.033502GBM
GBM_grid_1_AutoML_5_20220306_221925_model_1 4333.11 65.8263 4333.11 44.992 nan 780 0.02088 GBM
DRF_1_AutoML_5_20220306_221925 4419.64 66.4804 4419.64 43.4485 0.442421 1926 0.01215 DRF
GBM_grid_1_AutoML_5_20220306_221925_model_2 4774.18 69.0954 4774.18 47.7127nan 589 0.016533GBM
XRT_1_AutoML_5_20220306_221925 5867.56 76.6 5867.56 49.7583 0.5222 1181 0.006752DRF
GBM_grid_1_AutoML_5_20220306_221925_model_3 6105.05 78.1348 6105.05 52.6356nan 165 0.004975GBM
DeepLearning_1_AutoML_5_20220306_221925 13038 114.184 13038 73.527 nan 433 0.001543DeepLearning
GBM_grid_1_AutoML_5_20220306_221925_model_4 18872.3 137.376 18872.3 105.372 1.35474 47 0.000899GBM
DeepLearning_grid_1_AutoML_5_20220306_221925_model_1 19913.7 141.116 19913.7 99.0284nan 1074 0.003413DeepLearning
GLM_1_AutoML_5_20220306_221925 21745.6 147.464 21745.6 108.853 nan 194 0.000293GLM

Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.

Variable Importance

The variable importance plot shows the relative importance of the most important variables in the model.

Variable Importance Heatmap

Variable importance heatmap shows variable importance across multiple models. Some models in H2O return variable importance for one-hot (binary indicator) encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order for the variable importance of categorical columns to be compared across all model types we compute a summarization of the the variable importance across all one-hot encoded features and return a single variable importance for the original categorical feature. By default, the models and variables are ordered by their similarity.

Model Correlation

This plot shows the correlation between the predictions of the models. For classification, frequency of identical predictions is used. By default, models are ordered by their similarity (as computed by hierarchical clustering). Interpretable models, such as GAM, GLM, and RuleFit are highlighted using red colored text.

SHAP Summary

SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.





Individual Conditional Expectation

An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.





9. Explaining Leader Model from AutoML¶

In [ ]:
# Explain a single H2O model (e.g. leader model from AutoML)
exm = aml.leader.explain(test)

Residual Analysis

Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.

Partial Dependence Plots

Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.















Individual Conditional Expectation

An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.


10. Explainability¶

  • The y-axis in the SHAP diagram indicates the variable name, in order of importance from top to bottom. The value next to them is the mean SHAP value.
  • On the x-axis is the SHAP value. Indicates how much is the change in log-odds. From this number we can extract the probability of success.

  • We can see when the weather is bad, there are less bike rentals. This is seen from High and Negative SHAP values

  • We can also see the when the windspeed is high the bike rentals are less. This is intuitive and could be seen from High and Negative SHAP values.

  • When Temperature is slightly higher, there are more bike rentals. But as soon as temp decreases, there are sharp reduction of bike rentals

In [ ]: